Integrating Learning and Planning

Model-Based Reinforcement Learning

Model-Based Reinforcement Learning is a method that combines the model learning and the policy learning.

Model-Free RL

Model-Based RL

graph TD A(value/policy) -->|acting| B(experience) B -->|model learning| C(model) C -->|planning| A

Advantages of Model-Based RL

Disadvantages of Model-Based RL


What is the model?

Model M\mathcal{M} is a function that predicts the agent's next state and reward given the current state and action.

MDP=⟨S,A,P,R,η⟩MDP = \langle S, A, P, R, \eta \rangle

assume that the state S\mathcal{S} and action A\mathcal{A} are known. P\mathcal{P} is the transition probability function, R\mathcal{R} is the reward function, and η\eta is the discount factor.

Model Learning

Objective is to learn the model M\mathcal{M} from the experience. It is a supervised learning problem.

learning s,a→rs,a \rightarrow r is a regression problem. learning s,a→s′s,a \rightarrow s' is a density estimation problem.

A look-up table can be used to represent the model. For each state-action pair, the model stores the next state and reward.

S1×A1→S2×R1S_1 \times A_1 \rightarrow S_2 \times R_1
S1×A2→S3×R2S_1 \times A_2 \rightarrow S_3 \times R_2
â‹® \vdots

The table is updated, replacing the old state estimation with the one that results in the higher reward. However, the table can be very large, and it is not practical to store all the state-action pairs.

Another problem is that the sample count is important. If the sample count is low, the model would introduce bias.

Planning with inaccurate model

Given an imperfect model 〈Pη, Rη〉 6 = 〈P, R〉 Performance of model-based RL is limited to optimal policy. For approximate MDP 〈S, A, Pη, Rη〉 i.e. Model-based RL is only as good as the estimated model. When the model is inaccurate, planning process will compute. a suboptimal policy


Integrated Architectures

We consider two sources of experience: real experience and simulated experience.

Model-Free RL

Model-Based RL

Dyna

graph TD A(value/policy) -->|acting| B(experience) B -->|model-free learning| A B -->|model learning| C(model) C -->|planning| A

We have the model of the environment, and we can simulate the environment. However, it is not practical to simulate the environment for all possible actions. Look-ahead search is used to find the best action starting from the current state sts_t. It is just as solving an MDP.

Monte-Carlo Tree Seach: Given a model M\mathcal{M}, simulate K episodes starting from the current state sts_{t} and current simulation policy π\pi. The tree is expanded by selecting the best action at each state. Build a search tree containing the visited states and actions. Evaluate the value of the states and actions. After the search is finished, select the current action with the highest value.

Each simulation consists of two phases:

Repeat the simulation K times, and update the value of the states and actions.

Temporal-Difference Tree Search: The idea is to apply TD learning to the simulated experience. The value of the states and actions are updated using TD learning. Which is different than the MC tree search, where here the aim is to use a function approximator to estimate the value of the states and actions.

Advantages of MC Tree Search

Advantages of TD Tree Search


#MMI706 - Reinforcement Learning at METU